Loan Data Exploration by Ming Ho

This report explores a dataset containing data such as borrower APR and lender yield for about 114,000 peer to peer loans.

Univariate Plots Section

Here is the structure of the dataset.

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 8 levels "A","AA","B","C",..: 4 NA 7 NA NA NA NA NA NA NA ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2802 levels "2005-11-25 00:00:00",..: 1137 NA 1262 NA NA NA NA NA NA NA ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 7 levels "A","AA","B","C",..: NA 1 NA 1 5 3 6 4 2 2 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 51 levels "AK","AL","AR",..: 6 6 11 11 24 33 17 5 15 15 ...
##  $ Occupation                         : Factor w/ 67 levels "Accountant/CPA",..: 36 42 36 51 20 42 49 28 23 23 ...
##  $ EmploymentStatus                   : Factor w/ 8 levels "Employed","Full-time",..: 8 1 3 1 1 1 1 1 1 1 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 706 levels "00343376901312423168731",..: NA NA 334 NA NA NA NA NA NA NA ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11585 levels "1947-08-24 00:00:00",..: 8638 6616 8926 2246 9497 496 8264 7684 5542 5542 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

The data set contains 81 variables, with about 114,000 observations. I will only be looking at the following 18 variables.

Here are some basic summary statistics for the selected variables.

##       Term                       LoanStatus     BorrowerRate   
##  Min.   :12.00   Current              :51174   Min.   :0.0400  
##  1st Qu.:36.00   Completed            :17690   1st Qu.:0.1349  
##  Median :36.00   Chargedoff           : 4445   Median :0.1845  
##  Mean   :42.76   Defaulted            :  885   Mean   :0.1936  
##  3rd Qu.:60.00   Past Due (1-15 days) :  714   3rd Qu.:0.2549  
##  Max.   :60.00   Past Due (31-60 days):  322   Max.   :0.3600  
##                  (Other)              :  994                   
##   LenderYield     EstimatedReturn    ProsperRating..Alpha.  ProsperScore  
##  Min.   :0.0300   Min.   :-0.18160   A :13491              Min.   : 1.00  
##  1st Qu.:0.1249   1st Qu.: 0.07408   AA: 5097              1st Qu.: 4.00  
##  Median :0.1745   Median : 0.09110   B :14379              Median : 6.00  
##  Mean   :0.1836   Mean   : 0.09553   C :16501              Mean   : 6.08  
##  3rd Qu.:0.2449   3rd Qu.: 0.11500   D :12631              3rd Qu.: 8.00  
##  Max.   :0.3400   Max.   : 0.26670   E : 8443              Max.   :11.00  
##                                      HR: 5682                             
##  ListingCategoryNumber               Occupation         EmploymentStatus
##  Min.   : 0.000        Other              :18501   Employed     :65884  
##  1st Qu.: 1.000        Professional       : 9917   Full-time    : 7584  
##  Median : 1.000        Executive          : 3206   Other        : 2194  
##  Mean   : 3.302        Computer Programmer: 3038   Retired      :  320  
##  3rd Qu.: 3.000        Teacher            : 2777   Part-time    :  199  
##  Max.   :20.000        Analyst            : 2683   Self-employed:   42  
##                        (Other)            :36102   (Other)      :    1  
##  OpenCreditLines  CurrentDelinquencies DebtToIncomeRatio
##  Min.   : 0.000   Min.   : 0.0000      Min.   : 0.000   
##  1st Qu.: 6.000   1st Qu.: 0.0000      1st Qu.: 0.150   
##  Median : 9.000   Median : 0.0000      Median : 0.220   
##  Mean   : 9.602   Mean   : 0.3303      Mean   : 0.258   
##  3rd Qu.:12.000   3rd Qu.: 0.0000      3rd Qu.: 0.320   
##  Max.   :54.000   Max.   :32.0000      Max.   :10.010   
##                                                         
##          IncomeRange    StatedMonthlyIncome LoanMonthsSinceOrigination
##  $50,000-74,999:23696   Min.   :     0.2    Min.   : 0.00             
##  $25,000-49,999:21422   1st Qu.:  3583.3    1st Qu.: 4.00             
##  $100,000+     :13980   Median :  5041.7    Median :11.00             
##  $75,000-99,999:13547   Mean   :  6003.0    Mean   :15.95             
##  $1-24,999     : 3578   3rd Qu.:  7250.0    3rd Qu.:25.00             
##  Not employed  :    1   Max.   :483333.3    Max.   :56.00             
##  (Other)       :    0                                                 
##  LoanOriginalAmount   Investors   
##  Min.   : 1000      Min.   :   1  
##  1st Qu.: 4000      1st Qu.:   1  
##  Median : 8000      Median :  32  
##  Mean   : 9295      Mean   :  70  
##  3rd Qu.:14603      3rd Qu.: 100  
##  Max.   :35000      Max.   :1189  
## 

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0400  0.1349  0.1845  0.1936  0.2549  0.3600

First, I would like to look at the borrower interest’s rate. The borrower interest rate looks to be a slightly positively skewed distribution. Most borrower interest rates are between 0.125 and 0.249. There are also alot of local peaks at 0.14, 0.26, and 0.32. Let’s transform the data to see if it will make things more clear.

After transforming the data, there are a still few local peaks, but less drastic as the prior graph. I wonder why there is such a big spike for that rate. Let’s compare it with the lender yield.

The distribution of the LenderYield looks somewhat similar to the distribution of BorrowerRate, with the most common yield around 0.09. There is also big spikes at 0.14, 0.23, and 0.30, very similar to the spikes in BorrowerRate. The lender’s yield should be the borrower’s interest rate minus the servicing fee, so it makes sense that it is a bit less than BorrowerRate, but still very close. Let’s also look at the estimated return.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.18160  0.07408  0.09110  0.09553  0.11500  0.26670

The distribution of the estimated return looks rather different from BorrowerRate and LenderYield. It’s a bit skewed to the right, with the highest count around 0.75. There are a few loans with a negative estimated return. Let’s zoom into the negative returns.

Among the loans with negative estimated return, the most common negative return is around -0.025. From the previous table, I see that the lowest estimated return is -0.1816. I wonder if those are defaulted loans or older loans that were charged off. I wonder what the loan status of those loans are. First, let’s look at loan status for all loans.

## 
##              Cancelled             Chargedoff              Completed 
##                      0                   4445                  17690 
##                Current              Defaulted FinalPaymentInProgress 
##                  51174                    885                    187 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     14                    714                    241 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    322                    275                    277

Most of the loans are either current or completed. Loans that are past due are rather minimal. It’s unknown what completed actually represents though. Charged off means that the loan is past due more than 120 days and the balance is due in full. Now let’s look at loan status for the negative estimated returns.

The loan status for negative estimated return are either charged off, completed, or defaulted, with a majority of them being completed. I wonder why loans with a negative estimated return would be funded in the first place.

The debt to income ratio ranges from 0.01 to 10.01, with most ratio between 0.17 and 0.34. The ratio is capped at 10.01, as any ratio greater than 1000% is returned as 1001%. I’m curious to see what loan amount and borrower’s stated income was for this group of borrowers.

Above is the subset of the loan original amount and stated monthly income for borrowers with debt to income ratio of 10.01. Although these group of borrowers had a very low stated monthly income of less than $10, their original loan amount was way above 1000% of their monthly income. How were these borrowers able to get the loan? Let’s compare with borrowers with the lowest debt to income ratio.

Borrowers with the least debt to income ratio actually had lower loan amounts and higher stated income compared to borrowers with the highest debt to income ratio. Now let’s look at the overall income level.

## loan_cut
##   (1e+03,5e+03]   (5e+03,1e+04] (1e+04,1.5e+04] (1.5e+04,2e+04] 
##           28881           21710           16243            4698 
## (2e+04,2.5e+04] 
##            3429
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    8000    9295   14603   35000

Most loan amounts are under $11,000, with the highest amount of $35,000. The median amount is $6,500.

Transforming the data shortened the tail, but it doesn’t show a significant improvement. There are still a lot of local peaks with orders of magnitude.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.4000   0.6400   0.9475   1.1500 208.3300

Here I created a new variable measuring the ratio of stated monthly income to the original loan amount. I am interested in seeing how big of a loan was borrowed compared to the borrower’s income level. The most common ratio is around 0.33, while most fall between 0.4 and .95.

Overall, most of the borrowers make between $25,000 to $74,999. This makes sense as people with low to average income are more likely to require a loan. However, there are also borrowers with income over $100,000. Maybe they need a loan to make an investment? Let’s look at the loan listing categories that the borrower selected.

The most common category is Debt Consolidation, which overshadows every other category. The next most common categories are Other, Home and Business. Now let’s examine some other categorical variables.

## 
##      Employed     Full-time Not available  Not employed         Other 
##         65884          7584             0             1          2194 
##     Part-time       Retired Self-employed 
##           199           320            42

Most of the borrowers claimed employed, but it is difficult to determine what that actually means if they are neither employed full-time, part-time, or self-employed. Did the borrowers declare that they are employed just to not harm their chances of getting a loan approved?

Most borrowers selected Other as their occupation, with Professional being the second highest number. Surprising computer programmer is the third highest number. Usually programmers make a decent salary, so I wonder if there are other factors causing them to take out loans.

The loans were assigned a Prosper rating when the loan listing was created and evaluated for estimated average annualized loss rate range. The highest rating AA and the lowest rating HR have rather similar counts. The distribution is positively skewed, with the most common rating of A at about 4000.

The ProsperScore is a custom risk score which estimates the probability of a loan going 60+ days past due within the first year. The highest number (11) is the least risk and the lowest score (1) is the highest risk. Most of the risk scores are between 4 - 9, which would be medium risk. This distribution mimics a slightly positively skewed distribution. The score is based on other variables such as Debt to income ratio. The score is also used to determine the Prosper rating shown in the prior graph so there is a relationship between the two variables.

The original data was in months, and there was only three choices so I changed it to years for better readability. The length of terms are either 1, 3, or 5 years, with 3 years being the most popular term. I wonder how long the loans actually last.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    4.00   11.00   15.95   25.00   56.00

Although the most common term is 3 years (36 months), the most common duration of the loan is around 2 months. Most loans are within 3 years though.

Most borrowers have other open credit lines, with 7 being the most common. It follows a normal distribution if we remove the higher amount of open credit lines. I think that there is likely a relationship between the amount of open credit lines, debt to income ratio, loan status, and current delinquencies.

The distribution is positively skewed, wih 2 being the most common number of current delinquencies.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1      32      70     100    1189

The distribution is positively skewed. The most common number of investors is 1, and most loans have less than 50 investors. Let’s transform the long tail to read this better.

There is still an overwhelming amount of loans with 1 investor. Most loans have around 70 investors.

Univariate Analysis

What is the structure of your dataset?

The data set contains information about 114,000 loans with 81 features.

ProsperRating and ProsperScore are ordered factor variables with the following levels:

(worst) ——> (best) ProsperRating: HR, E, D, C, B, A, AA ProsperScore: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11

Most borrower’s interest rate is between 13% and 25%. The median loan amount is $6,500. Most loans have a Prosper rating of D or higher. Most loans have a Prosper score of 4 or higher. The mean duration of loans since origination is 20.87 months.

What is/are the main feature(s) of interest in your dataset?

The main features in this dataset is loan status, loan original amount, and estimated return. I’d like to determine if the Prosper rating is effective in predicting the status/outcome of a loan.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Prosper score, debt to income ratio, current delinquencies, open credit lines are likely to contribute to the status of a loan.

Did you create any new variables from existing variables in the dataset?

I created a new variable using StatedMonthlyIncome and LoanOriginalAmount to generate a ratio of income vs loan amount. I was curious about how much people borrowed compared to their stated income.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

BorrowerRate, LoanOriginalAmount, and Investors had unusual long tailed distribution. I used log10 to transform the data to shorten the long tails in hopes of getting something that resembled a normal distribution.

Bivariate Plots Section

I took a random sample of the data and created a scatterplot matrix to look at the correlation between a small subset of variables. There is a very high correlation between borrower’s interest rate and estimated return, but that is expected as the estimated return is calculated from the borrower’s rate. Borrower’s interest rate, estimated return, and monthly income is moderately correlated to loan amount.

There appears to be a slight upward trend as stated monthly income increases, the loan original amount also slightly increases. I used a smoother to confirm that indeed there is a slight positive trend. However, there are alot of vertical bands where loans have the same loan amount with different incomes. There seems to be a higher number of loans at each $5000 loan amount intervals.

I don’t see any strong correlation between Debt To Income Ratio and Loan Original Amount from this graph. Most borrowers have a Debt To Income Ratio of 0.30 and below, regardless of loan amount.

At higher end of loan amounts, the range of estimated return is smaller compared to the lower end of loan amounts. Larger loans have lower returns, so there must be something else that attracts lenders to fund larger loans. The loans with negative estimated returns tend to be below $10,000. Now let’s look at the categorical variables.

Most borrowers only have 1-2 current delinquencies regardless of the number of open credit lines. The number of current delinquencies tends to decrease as the number of credit lines increase. I was expecting that the number of delinquencies would increase as the number of open credit lines increase.

The loan amount tends to increase as the quality of listing score and rating increases. and decreases at the highest quality. I’m not sure why that is, as I would imagine that the highest rated listing would ask for the highest amount. The mean of loan amounts with the rating of C and higher seems to stay around $10,000. Higher rated loans have a higher range and median loan amount, which is expected.

Loans that were charged off or defaulted have a similar spread. The loans that are current, completed, or charged off have alot more outliers than the ones that are past due.

The median and range of returns for past due loans looks quite consistent. Charged off and defaulted loans have the highest estimated return, and some of the widest range. Completed loans have the widest range overall, including negative returnss.

The median score is between 5 - 6, which is in the middle range. The range of scores are pretty even throughout all scores, except for past due over 120 days, which is more narrow.

Students have the lowest median loan amounts and range. Borrowers in the who are dentists, doctors, and Nurse (RN) have the widest loan amount ranges. Doctors also have the highest median loan amount.

The highest loan amounts are related to business, children, and debt consolidation. Green loans have a low average loan amount, but the range between the 1st and 3rd quartile compares with categories mentioned above. Student loans has the lowest median loan amount.

Student loans has the lowest median loan amounts, but it has the highest median estimated return. Loans for debt consolidation has the highest loan amounts, but it has the lowest median estimated return. It also has the most negative estimated return.

The median and range of estimated return tend to decrease as the rating of the loan increase. This makes sense as riskier investments tend to have higher return in return for the higher risk.

Surprising loan for students have the highest median for borrower’s rate, even though they have the lowest median estimated return. They also have the largest IQR. It looks like the borrower rate is about two times the estimated return.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Estimated return tend to fall as the loan amount increases. Not surprisingly, there is a positive correlation between loan amount and the quality of the loan listing (measured by the Prosper Score and Rating). However, the loan amount decreases for the highest rated listings.

I was shocked to see that charged off and defaulted loans had the lowest median and IQR.

Most loan amounts are under $15,000, with debt consolidation being the most common category.

Student loans has the lowest median loan amount, yet it has the highest median estimated return and borrower rate.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

It was surprising to find that the number of delinquencies decreases as the number of open credit lines increase. I was thinking that it would be the opposite. I figured when someone borrow more against their credit, there would be a higher chance of falling behind on payments and becoming delinquent on their debt. Perhaps the borrowers with higher number of open credit lines know how to leverage their credit and manage their debt better.

What was the strongest relationship you found?

The strongest relationship was between the estimated return and the loan rating, and between loan amount and loan rating. The loan rating can be used in a model to predict the loan amount.

Multivariate Plots Section

Similar to previous observations, the highest quality listings have higher loan amount while lower loan listings with the worst quality tend to have a higher income to loan amount ratio. I can see how people with lower income might state a higher income to increase their chances of getting their loan funded. Most people are just borrowing a month or two worth of their stated monthly income. Unfortunately there is no data on verifiable income so the stated monthly income can be made up. I still see that there is a drastic drop in AA listings after the $25,000 loan mark. Now let’s take a look at loan categories.

Recalling from earlier graphs, student loans had the lowest median loan amount, yet this graph shows that it has moderate range of Income to Loan ratio. I was thinking borrowers with lower income would be more likely to borrow a higher amount compared to what they were making, but that doesn’t seem to be the case.

I did a log transformation to see if it would make the disbursement of different listing ratings more clear. It looks like most income to loan ratio fall below the 2.5 mark, except debt consolidation and home loans.

It looks like there is a positive correlation between stated monthly income and loan amount. I added a smoother line to fit the data, which indicates that for most statuses, the positive correlation weakens as the stated monthly income goes past $10,000.

If I look at the Debt to Income ratio instead, it looks like most of the loans that are past due are usually within 0.5, meaning they are mostly borrowing half of their income.

Similar to earlier observations, this graph shows that as the quality of the listing increase, the borrower rate decreases. However, I can also see that within each color band of ratings, the borrower rate actually increases as the loan amount increases. Now let’s look at loans with negative estimated return.

There is a negative correlation between estimated return and loan amount, so nothing new there. Let’s zoom into the negative returns.

It’s interesting to see that the loan categories that has a history of negative return are concentrated in debt consolidation, home, auto, business, and student loans. Negative returns are mostly limited to loan below $10,000, but encompasses all loan ratings and not just limited to low rated loans. There is a good number of B rated loans with a negative estimated return.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

All of the variables that I explored (IncomeLoanRatio, StatedMonthlyIncome, DebToIncomeRatio, BorrowerRate, EstimatedReturn) show a negative correlation with LoanOriginalAmount, which is consistent with previous findings.

Were there any interesting or surprising interactions between features?

Although most of the variables I explored showed a negative correlation as expected, there was one interesting interactive between the loan rating, the borrower interest rate, and the loan amount. Borrower rate follows the general negative trend as the loan amount increases, but within each loan rating, the lowest loan amount has the lowest borrower rate. The borrower rate increases as the loan amount increases, so there is a slight positive correlation for once.


Final Plots and Summary

Plot One

Description One

In general terms, the loan amount increases as the stated monthly income increases across all loan ratings. Yet, the loans with the highest rating (AA) are not necessarily associated with the highest loan amounts nor stated monthly income amounts. Loans with the second and third best rating (A and B) domintates the highest loan amounts (> $25,000).

Plot Two

Description Two

Only Auto, Business, Debt Consolidation, Home, Other, and Student loans have negative estimated returns. Loan ratings does not seem to matter as there is a fair representation from each level of rating.

Plot Three

Description Three

This plot shows a general negative correlation between borrower’s interest rate and the loan amount. However, there is a slight positive correlation when the loan rating is examined. For each rating(HR - worst, AA - best), the borrower’s rate slightly increases as the loan amount is increased. Also, the range of borrower interest rate becomes more narrow as the loan amount is increased, so the borrower rate for loans with lower quality may be influenced by other factors, but the influence becomes less as the loan rating increases. ——

Reflection

The data set contains information about 114,000 loan listings across 81 variables. so the first challenge was selecting variables to investigate. Then, I started examining each variable to understand them better and come up with more questions about the data.

Most variables had a negative correlation with loan amount, so I struggled with finding relationships that were not so obvious. I did find that although the borrower interest rate has a negative correlation with loan amount, there is a slight positive correlation within each rating.

I expected that stated monthly income would have a positive correlation with the loan amount, but I didn’t expect the correlation to weaken once the stated monthly income amount reaches a certain point.

It was difficult deciding the right graphs to plot for the different variable combinations. It was even harder to interpret the plots and coming up with meaningful insights. Some plots just weren’t interesting and I had a hard time deciding whether to keep the plots and state the obvious, or drop the plots all together.

For some variables, data was only available for listings that were created after mid-2009, so I had to omit the listings with missing data.

I think that the analysis can be improved if verifiable income is added into the data set. There is a variable indicating whether borrower have the required documentation to support their income, but I think that it would be interesting to see how their stated income matches up with their actual income. Some borrowers stated a monthly income of more than $50,000, which I think it highly unlikely and would skew the data too much.